We would like to understand who would be the most likely champion of speed dating as well as what would be the key drivers that affect people’s decision in selecting.
We used the data from speed dating experiment conducted at Columbia Business School available on kaggle This is how the first 5 out of the total of 2237 rows look:
| 01 | 02 | 03 | 04 | 05 | |
|---|---|---|---|---|---|
| attr_o | 6 | 6 | 7 | 8 | 6 |
| sinc_o | 7 | 5 | 7 | 8 | 6 |
| intel_o | 8 | 10 | 7 | 9 | 7 |
| fun_o | 7 | 6 | 9 | 8 | 7 |
| amb_o | 7 | 6 | 9 | 8 | 8 |
| shar_o | 5 | 5 | 9 | 9 | 7 |
| field_cd | 1 | 1 | 1 | 1 | 1 |
| race | 2 | 2 | 2 | 2 | 2 |
| goal | 1 | 1 | 1 | 1 | 1 |
| date | 5 | 5 | 5 | 5 | 5 |
| go_out | 1 | 1 | 1 | 1 | 1 |
| career_c | 1 | 1 | 1 | 1 | 1 |
| sports | 1 | 1 | 1 | 1 | 1 |
| tvsports | 1 | 1 | 1 | 1 | 1 |
| exercise | 6 | 6 | 6 | 6 | 6 |
| dining | 7 | 7 | 7 | 7 | 7 |
| museums | 6 | 6 | 6 | 6 | 6 |
| art | 7 | 7 | 7 | 7 | 7 |
| hiking | 7 | 7 | 7 | 7 | 7 |
| gaming | 5 | 5 | 5 | 5 | 5 |
| clubbing | 7 | 7 | 7 | 7 | 7 |
| reading | 7 | 7 | 7 | 7 | 7 |
| tv | 7 | 7 | 7 | 7 | 7 |
| theater | 9 | 9 | 9 | 9 | 9 |
| movies | 7 | 7 | 7 | 7 | 7 |
| concerts | 8 | 8 | 8 | 8 | 8 |
| music | 7 | 7 | 7 | 7 | 7 |
| shopping | 1 | 1 | 1 | 1 | 1 |
| yoga | 8 | 8 | 8 | 8 | 8 |
We followed the following approach to proceed with classification (as explained in class)
Let’s follow these steps.
We have three data samples: estimation_data (e.g. 80% of the data in our case), validation_data (e.g. the 10% of the data) and test_data (e.g. the remaining 10% of the data).
In our case we use 1789 observations in the estimation data, 224 in the validation data, and 224 in the test data.
Our dependent variable is: dec_o. It states whether given subject was selected by the .In our data the number of 0/1’s in our estimation sample is as follows.
=======Our dependent variable is: dec_o. It states whether given subject was selected by the partner. In our data the number of 0/1’s in our estimation sample is as follows.
>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f| Class 1 | Class 0 | |
|---|---|---|
| # of Observations | 875 | 914 |
while in the validation sample they are:
| Class 1 | Class 0 | |
|---|---|---|
| # of Observations | 118 | 106 |
Below are the statistics of our independent variables across the two classes, class 1, “selected”
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| attr_o | 2 | 6 | 7 | 7.38 | 8 | 10 | 1.46 |
| sinc_o | 0 | 7 | 8 | 7.60 | 9 | 10 | 1.48 |
| intel_o | 3 | 7 | 8 | 7.58 | 8 | 10 | 1.34 |
| fun_o | 0 | 6 | 7 | 7.23 | 8 | 10 | 1.53 |
| amb_o | 2 | 6 | 7 | 6.95 | 8 | 10 | 1.57 |
| shar_o | 0 | 5 | 7 | 6.41 | 8 | 10 | 1.86 |
| field_cd | 1 | 3 | 8 | 7.30 | 11 | 16 | 4.13 |
| race | 1 | 2 | 2 | 2.81 | 4 | 6 | 1.26 |
| goal | 1 | 1 | 2 | 1.93 | 2 | 6 | 1.27 |
| date | 1 | 4 | 6 | 5.24 | 6 | 7 | 1.39 |
| go_out | 1 | 1 | 2 | 2.11 | 3 | 7 | 0.99 |
| career_c | 1 | 2 | 4 | 5.18 | 9 | 15 | 3.47 |
| sports | 1 | 4 | 6 | 5.73 | 8 | 10 | 2.63 |
| tvsports | 1 | 2 | 3 | 3.96 | 6 | 10 | 2.52 |
| exercise | 1 | 5 | 7 | 6.43 | 8 | 10 | 2.52 |
| dining | 4 | 7 | 8 | 8.04 | 10 | 10 | 1.68 |
| museums | 2 | 7 | 7 | 7.36 | 9 | 10 | 1.84 |
| art | 2 | 6 | 7 | 7.06 | 9 | 10 | 2.15 |
| hiking | 1 | 4 | 7 | 6.04 | 8 | 10 | 2.46 |
| gaming | 1 | 1 | 2 | 3.17 | 5 | 9 | 2.22 |
| clubbing | 1 | 4 | 6 | 5.90 | 8 | 10 | 2.37 |
| reading | 2 | 7 | 8 | 7.87 | 9 | 10 | 1.94 |
| tv | 1 | 4 | 6 | 5.62 | 8 | 10 | 2.59 |
| theater | 1 | 6 | 8 | 7.48 | 9 | 10 | 2.13 |
| movies | 2 | 7 | 8 | 8.18 | 9 | 10 | 1.66 |
| concerts | 1 | 6 | 7 | 7.21 | 9 | 10 | 2.02 |
| music | 4 | 7 | 8 | 8.00 | 9 | 10 | 1.65 |
| shopping | 1 | 5 | 7 | 6.44 | 9 | 10 | 2.48 |
| yoga | 1 | 3 | 5 | 4.88 | 7 | 10 | 2.62 |
and class 0, “not selected”:
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| attr_o | 0 | 4 | 6 | 5.50 | 7 | 10 | 1.71 |
| sinc_o | 0 | 6 | 7 | 6.96 | 8 | 10 | 1.73 |
| intel_o | 1 | 6 | 7 | 6.98 | 8 | 10 | 1.58 |
| fun_o | 0 | 5 | 6 | 5.77 | 7 | 11 | 1.89 |
| amb_o | 1 | 5 | 6 | 6.27 | 7 | 10 | 1.78 |
| shar_o | 0 | 3 | 5 | 4.79 | 6 | 10 | 2.03 |
| field_cd | 1 | 3 | 9 | 7.57 | 10 | 16 | 3.84 |
| race | 1 | 2 | 2 | 2.87 | 4 | 6 | 1.29 |
| goal | 1 | 1 | 2 | 2.19 | 2 | 6 | 1.43 |
| date | 1 | 4 | 6 | 5.34 | 6 | 7 | 1.47 |
| go_out | 1 | 1 | 2 | 2.30 | 3 | 7 | 1.22 |
| career_c | 1 | 2 | 4 | 5.04 | 7 | 15 | 3.36 |
| sports | 1 | 3 | 5 | 5.32 | 8 | 10 | 2.66 |
| tvsports | 1 | 2 | 4 | 4.17 | 6 | 10 | 2.60 |
| exercise | 1 | 4 | 6 | 6.12 | 8 | 10 | 2.63 |
| dining | 4 | 7 | 9 | 8.31 | 10 | 10 | 1.57 |
| museums | 2 | 7 | 8 | 7.60 | 9 | 10 | 1.96 |
| art | 2 | 6 | 8 | 7.35 | 9 | 10 | 2.11 |
| hiking | 1 | 3 | 6 | 5.77 | 8 | 10 | 2.64 |
| gaming | 1 | 1 | 2 | 3.06 | 5 | 9 | 2.35 |
| clubbing | 1 | 4 | 6 | 5.73 | 8 | 10 | 2.36 |
| reading | 2 | 7 | 9 | 7.97 | 10 | 10 | 2.00 |
| tv | 1 | 4 | 6 | 5.86 | 8 | 10 | 2.53 |
| theater | 1 | 7 | 8 | 7.75 | 9 | 10 | 1.91 |
| movies | 2 | 7 | 9 | 8.38 | 10 | 10 | 1.35 |
| concerts | 1 | 6 | 8 | 7.29 | 9 | 10 | 2.13 |
| music | 4 | 7 | 8 | 8.16 | 9 | 10 | 1.51 |
| shopping | 1 | 5 | 7 | 6.70 | 9 | 10 | 2.51 |
| yoga | 1 | 3 | 5 | 5.00 | 7 | 10 | 2.75 |
A simple visualization of values is presented below using the box plots. These visually indicate simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc). For example, for class 0
<<<<<<< HEADand class 1:
and class 1:
For our assignent, we used three classification methods: logistic regression, classification and regression trees (CART) and machine learning (i.e. random forests).
Running a basic CART model with complexity control cp=0.01, leads to the following tree:
<<<<<<< HEADWhere the key decisions criteria could be explained by the following table.
>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f| Attribute | Name |
|---|---|
| IV1 | attr_o |
| IV4 | fun_o |
| IV6 | shar_o |
| IV5 | amb_o |
| IV2 | sinc_o |
| IV3 | intel_o |
| IV7 | field_cd |
| IV11 | go_out |
One can estimate larger trees through changing the tree’s complexity control parameter (in this case the rpart.control argument cp). For example, this is how the tree would look like if we set cp = 0.005:
For example, this is how the tree would look like if we set cp = 0.005:
| Attribute | Name |
|---|---|
| IV1 | attr_o |
| IV4 | fun_o |
| IV6 | shar_o |
| IV5 | amb_o |
| IV2 | sinc_o |
| IV3 | intel_o |
| IV15 | exercise |
| IV13 | sports |
| IV7 | field_cd |
| IV24 | theater |
| IV22 | reading |
| IV23 | tv |
| IV26 | concerts |
| IV11 | go_out |
| IV21 | clubbing |
Below we present the probability our validation data belong to class 1. For the first few validation data observations, using the first CART above, is:
| Actual Class | Probability of Class 1 | |
|---|---|---|
| Obs 1 | 1 | 0.80 |
| Obs 2 | 0 | 0.26 |
| Obs 3 | 0 | 0.26 |
| Obs 4 | 1 | 0.80 |
| Obs 5 | 1 | 0.29 |
Logistic Regression is a method similar to linear regression except that the dependent variable can be discrete (e.g. 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -3.0 | 0.8 | -3.8 | 0.0 |
| attr_o | 0.7 | 0.0 | 14.1 | 0.0 |
| sinc_o | -0.1 | 0.1 | -2.4 | 0.0 |
| intel_o | -0.1 | 0.1 | -1.4 | 0.2 |
| fun_o | 0.3 | 0.1 | 5.2 | 0.0 |
| amb_o | -0.2 | 0.1 | -3.6 | 0.0 |
| shar_o | 0.3 | 0.0 | 6.9 | 0.0 |
| field_cd | 0.0 | 0.0 | -0.9 | 0.4 |
| race | 0.1 | 0.1 | 1.0 | 0.3 |
| goal | 0.0 | 0.1 | -0.9 | 0.4 |
| date | 0.0 | 0.0 | 0.8 | 0.4 |
| go_out | -0.1 | 0.1 | -1.0 | 0.3 |
| career_c | 0.0 | 0.0 | 0.2 | 0.8 |
| sports | 0.0 | 0.0 | 0.8 | 0.4 |
| tvsports | -0.1 | 0.0 | -2.3 | 0.0 |
| exercise | 0.0 | 0.0 | 0.0 | 1.0 |
| dining | 0.0 | 0.0 | -1.0 | 0.3 |
| museums | 0.0 | 0.1 | 0.1 | 0.9 |
| art | 0.0 | 0.1 | 0.1 | 0.9 |
| hiking | 0.0 | 0.0 | 0.8 | 0.4 |
| gaming | 0.0 | 0.0 | -0.4 | 0.7 |
| clubbing | 0.0 | 0.0 | -0.9 | 0.4 |
| reading | 0.0 | 0.0 | -0.8 | 0.4 |
| tv | 0.0 | 0.0 | 0.8 | 0.4 |
| theater | -0.1 | 0.0 | -1.9 | 0.1 |
| movies | -0.1 | 0.1 | -1.3 | 0.2 |
| concerts | 0.0 | 0.0 | 0.8 | 0.4 |
| music | 0.0 | 0.1 | -0.2 | 0.9 |
| shopping | 0.0 | 0.0 | -0.7 | 0.5 |
| yoga | 0.0 | 0.0 | -1.8 | 0.1 |
Random forests is another technique that was used…
Given a set of independent variables, the output of the estimated logistic regression (the sum of the products of the independent variables with the corresponding regression coefficients) can be used to assess the probability an observation belongs to one of the classes. Specifically, the regression output can be transformed into a probability of belonging to, say, class 1 for each observation. In our case, the probability our validation data belong to class 1 (e.g. the customer is likely to purchase a boat) for the first few validation data observations, using the logistic regression above, is:
| Actual Class | Probability of Class 1 | |
|---|---|---|
| Obs 1 | 1 | 0.56 |
| Obs 2 | 1 | 0.18 |
| Obs 3 | 0 | 0.67 |
| Obs 4 | 0 | 0.68 |
| Obs 5 | 1 | 0.10 |
The default decision is to classify each observation in the group with the highest probability - but one can change this choice, as we discuss below.
Selecting the best subset of independent variables for logistic regression, a special case of the general problem of feature selection, is an iterative process where both the significance of the regression coefficients as well as the performance of the estimated logistic regression model on the first validation data are used as guidance. A number of variations are tested in practice, each leading to different performances, which we discuss next.
In our case, we can see the relative importance of the independent variables using the variable.importance of the CART trees (see help(rpart.object) in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1 (the scaling is done for each method separately - note that CART does not provide the sign of the “coefficients”). From this table we can see the key drivers of the classification according to each of the methods we used here.
Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making.
Beloow table shows us key drivers of the classification according to each of the used methods.
>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f| CART 1 | CART 2 | Logistic Regr. | Random Forests - mean decrease in accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| attr_o | 1.00 | 1.00 | 1.00 | 1.00 | ||||
| sinc_o | <<<<<<< HEAD-0.19 | -0.18 | -0.10 | 0.21 | =======-0.28 | -0.28 | -0.17 | 0.18 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| intel_o | -0.25 | -0.27 | -0.10 | 0.19 | <<<<<<< HEAD0.18 | 0.01 | 0.15 | |
| fun_o | 0.45 | 0.44 | 0.41 | 0.57 | =======||||
| fun_o | 0.57 | 0.57 | 0.37 | 0.48 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||||
| amb_o | -0.32 | -0.32 | -0.26 | 0.10 | ||||
| shar_o | <<<<<<< HEAD0.42 | 0.41 | 0.58 | 0.74 | =======0.40 | 0.40 | 0.49 | 0.56 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
| field_cd | 0.00 | -0.01 | -0.06 | 0.17 | ||||
| race | 0.00 | 0.00 | 0.07 | 0.10 | <<<<<<< HEAD0.09 | ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| goal | 0.00 | 0.00 | <<<<<<< HEAD-0.08 | 0.17 | =======-0.06 | 0.12 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| date | 0.00 | 0.00 | <<<<<<< HEAD0.03 | 0.20 | =======0.06 | 0.14 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| go_out | 0.00 | 0.00 | -0.07 | 0.11 | ||||
| career_c | 0.00 | 0.00 | <<<<<<< HEAD-0.01 | 0.15 | =======0.01 | 0.14 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| sports | 0.00 | <<<<<<< HEAD0.00 | 0.05 | =======0.02 | 0.06 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f0.16 | ||
| tvsports | 0.00 | 0.00 | <<<<<<< HEAD-0.10 | 0.16 | =======-0.16 | 0.09 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| exercise | 0.00 | 0.00 | <<<<<<< HEAD0.06 | 0.18 | =======0.00 | 0.15 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| dining | 0.00 | 0.00 | <<<<<<< HEAD-0.05 | 0.16 | =======-0.07 | 0.14 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| museums | 0.00 | 0.00 | <<<<<<< HEAD-0.02 | 0.18 | =======0.01 | 0.16 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| art | 0.00 | 0.00 | 0.01 | 0.14 | ||||
| hiking | 0.00 | 0.00 | 0.06 | 0.15 | ||||
| gaming | 0.00 | 0.00 | <<<<<<< HEAD-0.09 | 0.16 | =======-0.03 | 0.10 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| clubbing | 0.00 | 0.00 | -0.06 | <<<<<<< HEAD0.16 | =======0.15 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f|||
| reading | 0.00 | 0.00 | <<<<<<< HEAD0.00 | 0.14 | =======-0.06 | 0.12 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| tv | 0.00 | 0.00 | <<<<<<< HEAD-0.03 | 0.16 | =======0.06 | 0.14 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| theater | 0.00 | 0.00 | <<<<<<< HEAD-0.08 | 0.16 | =======-0.13 | 0.15 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| movies | 0.00 | 0.00 | <<<<<<< HEAD-0.01 | 0.16 | =======-0.09 | 0.15 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| concerts | 0.00 | 0.00 | <<<<<<< HEAD0.01 | 0.19 | =======0.06 | 0.11 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| music | 0.00 | 0.00 | -0.01 | 0.15 | ||||
| shopping | 0.00 | 0.00 | <<<<<<< HEAD0.13 | 0.20 | =======-0.05 | 0.14 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f||
| yoga | 0.00 | 0.00 | -0.13 | 0.11 |
In general we do not see very significant differences across all used methods which makes sense.
Below is the percentage of the observations that have been correctly classified (the predicted is the same as the actual class), i.e. exceeded the probability threshold 50% for the validation data:
| Hit Ratio | ||
|---|---|---|
| First CART | 70.08929 | |
| Second CART | 70.08929 | |
| Logistic Regression | 70.08929 | |
| Random Forests | <<<<<<< HEAD68.37209 | =======66.51786 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
while for the estimation data the hit rates are:
| Hit Ratio | ||
|---|---|---|
| First CART | 75.29346 | |
| Second CART | 75.90833 | |
| Logistic Regression | 75.46115 | |
| Random Forests | <<<<<<< HEAD99.82553 | =======99.83231 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
A simple benchmark to compare the performance of a classification model against is the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do not intent do purchase a boat: 106 out of 224 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 47.32% - without doing any work.
In our case this particular criterion is met for all the methods that were used.
The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression, 2 CART models and random forests), the confusion matrix for the validation data is:
| Predicted 1 | Predicted 0 | |
|---|---|---|
| Actual 1 | 73.73 | 26.27 |
| Actual 0 | 66.04 | 33.96 |
Remember that each observation is classified by our model according to the probabilities Pr(0) and Pr(1) and a chosen probability threshold. Typically we set the probability threshold to 0.5 - so that observations for which Pr(1) > 0.5 are classified as 1’s. However, we can vary this threshold, for example if we are interested in correctly predicting all 1’s but do not mind missing some 0’s (and vice-versa) - can you think of such a scenario?
When we change the probability threshold we get different values of hit rate, false positive and false negative rates, or any other performance metric. We can plot for example how the false positive versus true posititive rates change as we alter the probability threshold, and generate the so called ROC curve.
The ROC curves for the validation data for both the CARTs above as well as the logistic regression are as follows:
How should a good ROC curve look like? A rule of thumb in assessing ROC curves is that the “higher” the curve, hence the larger the area under the curve, the better. You may also select one point on the ROC curve (the “best one” for our purpose) and use that false positive/false negative performances (and corresponding threshold for P(0)) to assess your model. Which point on the ROC should we select?
By changing the probability threshold, we can also generate the so called lift curve, which is useful for certain applications e.g. in marketing or credit risk. For example, consider the case of capturing fraud by examining only a few transactions instead of every single one of them. In this case we may want to examine as few transactions as possible and capture the maximum number of frauds possible. We can measure the percentage of all frauds we capture if we only examine, say, x% of cases (the top x% in terms of Probability(fraud)). If we plot these points [percentage of class 1 captured vs percentage of all data examined] while we change the threshold, we get a curve that is called the lift curve.
The Lift curves for the validation data for our three classifiers are the following:
How should a good Lift Curve look like? Notice that if we were to randomly examine transactions, the “random prediction” lift curve would be a 45 degrees straight diagonal line (why?)! So the further above this 45 degrees line our Lift curve is, the better the “lift”. Moreover, much like for the ROC curve, one can select the probability threshold appropriately so that any point of the lift curve is selected. Which point on the lift curve should we select in practice?
Finally, we can generate the so called profit curve, which we often use to make our final decisions. The intuition is as follows. Consider a direct marketing campaign, and suppose it costs $ 1 to send an advertisement, and the expected profit from a person who responds positively is $45. Suppose you have a database of 1 million people to whom you could potentially send the ads. What fraction of the 1 million people should you send ads (typical response rates are 0.05%)? To answer this type of questions we need to create the profit curve, which is generated by changing again the probability threshold for classifying observations: for each threshold value we can simply measure the total Expected Profit (or loss) we would generate. This is simply equal to:
Total Expected Profit = (% of 1’s correctly predicted)x(value of capturing a 1) + (% of 0’s correctly predicted)x(value of capturing a 0) + (% of 1’s incorrectly predicted as 0)x(cost of missing a 1) + (% of 0’s incorrectly predicted as 1)x(cost of missing a 0)
Calculating the expected profit requires we have an estimate of the 4 costs/values: value of capturing a 1 or a 0, and cost of misclassifying a 1 into a 0 or vice versa.
Given the values and costs of correct classifications and misclassifications, we can plot the total expected profit (or loss) as we change the probabibility threshold, much like how we generated the ROC and the Lift Curves. Here is the profit curve for our example if we consider the following business profit and loss for the correctly classified as well as the misclassified customers:
| Predict 1 | Predict 0 | |
|---|---|---|
| Actual 1 | 100 | -75 |
| Actual 0 | -50 | 0 |
Based on these profit and cost estimates, the profit curves for the validation data for the three classifiers are:
We can then select the threshold that corresponds to the maximum expected profit (or minimum loss, if necessary).
Notice that for us to maximize expected profit we need to have the cost/profit for each of the 4 cases! This can be difficult to assess, hence typically some sensitivity analysis to our assumptions about the cost/profit needs to be done: for example, we can generate different profit curves (i.e. worst case, best case, average case scenarios) and see how much the best profit we get varies, and most important how our selection of the classification model and of the probability threshold vary as these are what we need to eventually decide.
=======The ROC curves for the validation data for all four methods is below:
The Lift curves for the validation data for our four classifiers are the following:
Below are presented hit ratios for all four methods based on test dataset:
| Hit Ratio | ||
|---|---|---|
| First CART | 73.21429 | |
| Second CART | 73.21429 | |
| Logistic Regression | 69.19643 | |
| Random Forests | <<<<<<< HEAD76.27907 | =======70.98214 | >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
The Confusion Matrix for the model with the best validation data hit ratio above:
| Predicted 1 | Predicted 0 | |
|---|---|---|
| Actual 1 | 69 | 31 |
| Actual 0 | 77 | 23 |
ROC curves for the test data:
<<<<<<< HEADLift Curves for the test data:
Finally the profit curves for the test data, using the same profit/cost estimates as we did above:
Lift Curves for the test data: